Goto

Collaborating Authors

 resource-constrained device


Thinking Forward: Memory-Efficient Federated Finetuning of Language Models

Neural Information Processing Systems

Finetuning large language models (LLMs) in federated learning (FL) settings has become increasingly important as it allows resource-constrained devices to finetune a model using private data. However, finetuning LLMs using backpropagation requires excessive memory (especially from intermediate activations) for resource-constrained devices. While Forward-mode Auto-Differentiation (AD) can significantly reduce memory footprint from activations, we observe that directly applying it to LLM finetuning results in slow convergence and poor accuracy. In this paper, we introduce Spry, an FL algorithm that splits trainable weights of an LLM among participating clients, such that each client computes gradients using forward-mode AD that are closer estimations of the true gradients. Spry achieves a low memory footprint, high accuracy, and fast convergence. We formally prove that the global gradients in Spry are unbiased estimators of true global gradients for homogeneous data distributions across clients, while heterogeneity increases bias of the estimates. We also derive Spry's convergence rate, showing that the gradients decrease inversely proportional to the number of FL rounds, indicating the convergence up to the limits of heterogeneity. Empirically, Spry reduces the memory footprint during training by 1.4-7.1$\times$ in contrast to backpropagation, while reaching comparable accuracy, across a wide range of language tasks, models, and FL settings. Spry reduces the convergence time by 1.2-20.3$\times$


Multiple Instance Learning for Efficient Sequential Data Classification on Resource-constrained Devices

Neural Information Processing Systems

We study the problem of fast and efficient classification of sequential data (such as time-series) on tiny devices, which is critical for various IoT related applications like audio keyword detection or gesture detection. Such tasks are cast as a standard classification task by sliding windows over the data stream to construct data points. Deploying such classification modules on tiny devices is challenging as predictions over sliding windows of data need to be invoked continuously at a high frequency. Each such predictor instance in itself is expensive as it evaluates large models over long windows of data. In this paper, we address this challenge by exploiting the following two observations about classification tasks arising in typical IoT related applications: (a) the signature of a particular class (e.g. an audio keyword) typically occupies a small fraction of the overall data, and (b) class signatures tend to be discernible early on in the data. We propose a method, EMI-RNN, that exploits these observations by using a multiple instance learning formulation along with an early prediction technique to learn a model that achieves better accuracy compared to baseline models, while simultaneously reducing computation by a large fraction. For instance, on a gesture detection benchmark [ 25 ], EMI-RNN improves standard LSTM model's accuracy by up to 1% while requiring 72x less computation. This enables us to deploy such models for continuous real-time prediction on a small device such as Raspberry Pi0 and Arduino variants, a task that the baseline LSTM could not achieve. Finally, we also provide an analysis of our multiple instance learning algorithm in a simple setting and show that the proposed algorithm converges to the global optima at a linear rate, one of the first such result in this domain.


MONAQ: Multi-Objective Neural Architecture Querying for Time-Series Analysis on Resource-Constrained Devices

Trirat, Patara, Lee, Jae-Gil

arXiv.org Artificial Intelligence

The growing use of smartphones and IoT devices necessitates efficient time-series analysis on resource-constrained hardware, which is critical for sensing applications such as human activity recognition and air quality prediction. Recent efforts in hardware-aware neural architecture search (NAS) automate architecture discovery for specific platforms; however, none focus on general time-series analysis with edge deployment. Leveraging the problem-solving and reasoning capabilities of large language models (LLM), we propose MONAQ, a novel framework that reformulates NAS into Multi-Objective Neural Architecture Querying tasks. MONAQ is equipped with multimodal query generation for processing multimodal time-series inputs and hardware constraints, alongside an LLM agent-based multi-objective search to achieve deployment-ready models via code generation. By integrating numerical data, time-series images, and textual descriptions, MONAQ improves an LLM's understanding of time-series data. Experiments on fifteen datasets demonstrate that MONAQ-discovered models outperform both handcrafted models and NAS baselines while being more efficient.


Efficient Continual Learning in Keyword Spotting using Binary Neural Networks

Vu, Quynh Nguyen-Phuong, Martinez-Rau, Luciano Sebastian, Zhang, Yuxuan, Tran, Nho-Duc, Oelmann, Bengt, Magno, Michele, Bader, Sebastian

arXiv.org Artificial Intelligence

Keyword spotting (KWS) is an essential function that enables interaction with ubiquitous smart devices. However, in resource-limited devices, KWS models are often static and can thus not adapt to new scenarios, such as added keywords. To overcome this problem, we propose a Continual Learning (CL) approach for KWS built on Binary Neural Networks (BNNs). The framework leverages the reduced computation and memory requirements of BNNs while incorporating techniques that enable the seamless integration of new keywords over time. This study evaluates seven CL techniques on a 16-class use case, reporting an accuracy exceeding 95% for a single additional keyword and up to 86% for four additional classes. Sensitivity to the amount of training samples in the CL phase, and differences in computational complexities are being evaluated. These evaluations demonstrate that batch-based algorithms are more sensitive to the CL dataset size, and that differences between the computational complexities are insignificant. These findings highlight the potential of developing an effective and computationally efficient technique for continuously integrating new keywords in KWS applications that is compatible with resource-constrained devices.


SILENCE: Protecting privacy in offloaded speech understanding on resource-constrained devices

Neural Information Processing Systems

Speech serves as a ubiquitous input interface for embedded mobile devices. Cloud-based solutions, while offering powerful speech understanding services, raise significant concerns regarding user privacy. To address this, disentanglement-based encoders have been proposed to remove sensitive information from speech signals without compromising the speech understanding functionality. However, these encoders demand high memory usage and computation complexity, making them impractical for resource-constrained wimpy devices.Our solution is based on a key observation that speech understanding hinges on long-term dependency knowledge of the entire utterance, in contrast to privacy-sensitive elements that are short-term dependent. Exploiting this observation, we propose SILENCE, a lightweight system that selectively obscuring short-term details, without damaging the long-term dependent speech understanding performance.The crucial part of SILENCE is a differential mask generator derived from interpretable learning to automatically configure the masking process.We have implemented SILENCE on the STM32H7 microcontroller and evaluate its efficacy under different attacking scenarios.


Thinking Forward: Memory-Efficient Federated Finetuning of Language Models

Neural Information Processing Systems

Finetuning large language models (LLMs) in federated learning (FL) settings has become increasingly important as it allows resource-constrained devices to finetune a model using private data. However, finetuning LLMs using backpropagation requires excessive memory (especially from intermediate activations) for resource-constrained devices. While Forward-mode Auto-Differentiation (AD) can significantly reduce memory footprint from activations, we observe that directly applying it to LLM finetuning results in slow convergence and poor accuracy. In this paper, we introduce Spry, an FL algorithm that splits trainable weights of an LLM among participating clients, such that each client computes gradients using forward-mode AD that are closer estimations of the true gradients. Spry achieves a low memory footprint, high accuracy, and fast convergence.


Demo: A Practical Testbed for Decentralized Federated Learning on Physical Edge Devices

Feng, Chao, Huber, Nicolas, Celdran, Alberto Huertas, Bovet, Gerome, Stiller, Burkhard

arXiv.org Artificial Intelligence

--Federated Learning (FL) enables collaborative model training without sharing raw data, preserving participant privacy. Decentralized FL (DFL) eliminates reliance on a central server, mitigating the single point of failure inherent in the traditional FL paradigm, while introducing deployment challenges on resource-constrained devices. T o evaluate real-world applicability, this work designs and deploys a physical testbed using edge devices such as Raspberry Pi and Jetson Nano. The testbed is built upon a DFL training platform, NEBULA, and extends it with a power monitoring module to measure energy consumption during training. Experiments across multiple datasets show that model performance is influenced by the communication topology, with denser topologies leading to better outcomes in DFL settings.


Sponge Attacks on Sensing AI: Energy-Latency Vulnerabilities and Defense via Model Pruning

Hasan, Syed Mhamudul, Zangoti, Hussein, Anagnostopoulos, Iraklis, Shahid, Abdur R.

arXiv.org Artificial Intelligence

Recent studies have shown that sponge attacks can significantly increase the energy consumption and inference latency of deep neural networks (DNNs). However, prior work has focused primarily on computer vision and natural language processing tasks, overlooking the growing use of lightweight AI models in sensing-based applications on resource-constrained devices, such as those in Internet of Things (IoT) environments. These attacks pose serious threats of energy depletion and latency degradation in systems where limited battery capacity and real-time responsiveness are critical for reliable operation. This paper makes two key contributions. First, we present the first systematic exploration of energy-latency sponge attacks targeting sensing-based AI models. Using wearable sensing-based AI as a case study, we demonstrate that sponge attacks can substantially degrade performance by increasing energy consumption, leading to faster battery drain, and by prolonging inference latency. Second, to mitigate such attacks, we investigate model pruning, a widely adopted compression technique for resource-constrained AI, as a potential defense. Our experiments show that pruning-induced sparsity significantly improves model resilience against sponge poisoning. We also quantify the trade-offs between model efficiency and attack resilience, offering insights into the security implications of model compression in sensing-based AI systems deployed in IoT environments.


SplitFrozen: Split Learning with Device-side Model Frozen for Fine-Tuning LLM on Heterogeneous Resource-Constrained Devices

Ma, Jian, Lyu, Xinchen, Jiang, Jun, Cui, Qimei, Yao, Haipeng, Tao, Xiaofeng

arXiv.org Artificial Intelligence

Fine-tuning large language models (LLMs) on private, on-device data can empower tailored personalized AI agents. However, fine-tuning LLMs on resource-constrained edge devices faces significant challenges, including excessive computation overhead, device heterogeneity, and data imbalance. This paper proposes SplitFrozen, a split learning framework that enables efficient LLM fine-tuning by strategically freezing device-side model layers while centralizing parameter-efficient fine-tuning on the server. Our framework partitions LLMs into device-side frozen layers and server-side fine-tuning layers, where heterogeneous resource-constrained devices execute only forward propagation. To minimize server-side training costs, we integrate Low-Rank Adaptation (LoRA) into the server-side layers. A pipeline parallelism strategy further optimizes training efficiency by decoupling device-server computations and leveraging decomposed backward propagation. Experiments on GPT-2 with the MRPC, MNLI-matched, and SST-2 datasets demonstrate that SplitFrozen outperforms FedLoRA and SplitLoRA by 69.4\% model accuracy under extremely imbalanced data, while reducing up to 86.8\% device-side computations and 50.2\% total training time. Experiments also validate the scalability of SplitFrozen on content generation task using Llama-3.2 model on GSM8K dataset.


DQA: An Efficient Method for Deep Quantization of Deep Neural Network Activations

Hu, Wenhao, Henderson, Paul, Cano, José

arXiv.org Machine Learning

Quantization of Deep Neural Network (DNN) activations is a commonly used technique to reduce compute and memory demands during DNN inference, which can be particularly beneficial on resource-constrained devices. To achieve high accuracy, existing methods for quantizing activations rely on complex mathematical computations or perform extensive searches for the best hyper-parameters. However, these expensive operations are impractical on devices with limited computation capabilities, memory capacities, and energy budgets. Furthermore, many existing methods do not focus on sub-6-bit (or deep) quantization. To fill these gaps, in this paper we propose DQA (Deep Quantization of DNN Activations), a new method that focuses on sub-6-bit quantization of activations and leverages simple shifting-based operations and Huffman coding to be efficient and achieve high accuracy. We evaluate DQA with 3, 4, and 5-bit quantization levels and three different DNN models for two different tasks, image classification and image segmentation, on two different datasets. DQA shows significantly better accuracy (up to 29.28%) compared to the direct quantization method and the state-of-the-art NoisyQuant for sub-6-bit quantization.